For one of my programs, I need to filter and remove some words from a long text. I need to remove pronouns, conjugated verbs, determiners and some other words. other words.
The first way to do this would be to simply filter the word list wrdstxt
by testing the absence of each word in the wrdsfilt
list using notElem
:
filterByList (wrdsfilt, wrdstxt) = filter (`notElem` wrdsfilt) wrdstxt
For a small number of words used as a filter, this function would be suitable, but as the number of words to be deleted increase, the processing time will increase too1.
So we have to find a way to optimize the process.
The solution would be to store the words not in a list but in a search tree. The words sorted and organized in this way would be faster to search.
After searching for documentation on the subject, I found several solutions to answer my problem.
But what is the most efficient solution?
That's what we will see in this benchmark!
This test will be performed on a machine running Debian 10 Linux with an AMD Ryzen 7 3700X 8 core 16 thread processor with 32GB of DDR4 memory.
The benchmark will be done with Haskell and Criterion with all compiler optimizations enabled ghc -o2 -fforce-recomp
.
The list of words to be filtered will be this one: Mots_a_filtrer.txt
The text to scan will be Travel to the center of the earth retrieved from the Gutenberg site
The words will be loaded before the test, in order not to distort the results.
The search trees will be created during the test in order to take into account the time needed to create the tree.
The different solutions reviewed are :
The container package provides various standard modules for storing data and in particular the Set
.
This structure is a binary search tree (2 branches) automatically balanced when it is creation.
This tree is a tree of my own design containing a branch for each letter of the alphabet. The order of the trees corresponds to the alphabetical order: abcdefghijklmnopqrstuvwxyz
This tree can't be rebalanced by itself.
The member
function to test the presence of an element is implemented in this way:
member v0 tree = go v0 tree
where
go _ Tip = False
go _ t@(Fork vs Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip) = v0 `elem` vs
go [] t@(Fork vs _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ) = v0 `elem` vs
go (a : as) (Fork vs t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 t17 t18 t19 t20 t21 t22 t23 t24 t25 t26) | a == 'a' = go as t1
| a == 'b' = go as t2
| a == 'c' = go as t3
| a == 'd' = go as t4
| a == 'e' = go as t5
| a == 'f' = go as t6
| a == 'g' = go as t7
| a == 'h' = go as t8
| a == 'i' = go as t9
| a == 'j' = go as t10
| a == 'k' = go as t11
| a == 'l' = go as t12
| a == 'm' = go as t13
| a == 'n' = go as t14
| a == 'o' = go as t15
| a == 'p' = go as t16
| a == 'q' = go as t17
| a == 'r' = go as t18
| a == 's' = go as t19
| a == 't' = go as t20
| a == 'u' = go as t21
| a == 'v' = go as t22
| a == 'w' = go as t23
| a == 'x' = go as t24
| a == 'y' = go as t25
| otherwise = go as t26
go _ _ = False
The full module can be found at this adress : WordTree1.hs
This tree is of the same type as the WordTree1
tree except that the order of the trees corresponds to the order of the most commonly used letters in the French language2: eairsntolucmpdghbfvyqkzx
member v0 tree = go v0 tree
where
go _ Tip = False
go _ t@(Fork vs Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip) = v0 `elem` vs
go [] t@(Fork vs _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ) = v0 `elem` vs
go (a : as) (Fork vs t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 t17 t18 t19 t20 t21 t22 t23 t24 t25) | a == 'e' = go as t1
| a == 'a' = go as t2
| a == 'i' = go as t3
| a == 'r' = go as t4
| a == 's' = go as t5
| a == 'n' = go as t6
| a == 't' = go as t7
| a == 'o' = go as t8
| a == 'l' = go as t9
| a == 'u' = go as t10
| a == 'c' = go as t11
| a == 'm' = go as t12
| a == 'p' = go as t13
| a == 'd' = go as t14
| a == 'g' = go as t15
| a == 'h' = go as t16
| a == 'b' = go as t17
| a == 'f' = go as t18
| a == 'v' = go as t19
| a == 'y' = go as t20
| a == 'q' = go as t21
| a == 'k' = go as t22
| a == 'z' = go as t23
| a == 'x' = go as t24
| otherwise = go as t25
go _ _ = False
Le module complet se trouve ici : WordTree2.hs
This tree cannot rebalance itself.
This tree is a modification of the WordTree2
tree with a grouping of some letters in some branches of the tree. This grouping is done according to the occurrences of these letters and done with the elem
function.
The grouping of letters will be:
E
A
I
R
S
N
T
O
LU
CMP
DGHB
…
member v0 tree = go v0 tree
where
go _ Tip = False
go _ t@(Fork vs Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip) = v0 `elem` vs
go [] t@(Fork vs _ _ _ _ _ _ _ _ _ _ _ _ ) = v0 `elem` vs
go (a : as) (Fork vs t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12) | a == 'e' = go as t1
| a == 'a' = go as t2
| a == 'i' = go as t3
| a == 'r' = go as t4
| a == 's' = go as t5
| a == 'n' = go as t6
| a == 't' = go as t7
| a == 'o' = go as t8
| elem a "lu" = go as t9
| elem a "cmp" = go as t10
| elem a "dghb" = go as t11
| otherwise = go as t12
go _ _ = False
The full module can be found at this adress : WordTree2g.hs
This tree is an improvement of the WordTree2
tree. The order of the trees corresponds to the order of the letters most commonly used in the list of words to be filtered3:
member v0 tree = go v0 tree
where
go _ Tip = False
go _ t@(Fork vs Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip Tip) = v0 `elem` vs
go [] t@(Fork vs _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ) = v0 `elem` vs
go (a : as) (Fork vs t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 t17 t18 t19 t20 t21 t22 t23 t24) | a == 'e' = go as t1
| a == 't' = go as t2
| a == 'u' = go as t3
| a == 's' = go as t4
| a == 'i' = go as t5
| a == 'a' = go as t6
| a == 'o' = go as t7
| a == 'r' = go as t8
| a == 'n' = go as t9
| a == 'l' = go as t10
| a == 'c' = go as t11
| a == 'd' = go as t12
| a == 'q' = go as t13
| a == 'p' = go as t14
| a == 'f' = go as t15
| a == 'v' = go as t16
| a == 'm' = go as t17
| a == 'x' = go as t18
| a == 'z' = go as t19
| a == 'h' = go as t20
| a == 'g' = go as t21
| a == 'j' = go as t22
| a == 'y' = go as t23
| otherwise = go as t24
go _ _ = False
The full module can be found at this adress : WordTree3.hs
This tree is a modification of the WordTree3
tree with a grouping of some letters in some branches of the tree. This grouping is done according to the occurrences of these letters and done with the elem
function.
The grouping of letters will be:
E
T
U
R
S
IA
OR
NLC
DQPFVMXZH
…
member v0 tree = go v0 tree
where
go _ Tip = False
go _ t@(Fork vs Tip Tip Tip Tip Tip Tip Tip Tip Tip) = v0 `elem` vs
go [] t@(Fork vs _ _ _ _ _ _ _ _ _ ) = v0 `elem` vs
go (a : as) (Fork vs t1 t2 t3 t4 t5 t6 t7 t8 t9) | a == 'e' = go as t1
| a == 't' = go as t2
| a == 'u' = go as t3
| a == 's' = go as t4
| elem a "ia" = go as t5
| elem a "or" = go as t6
| elem a "nlc" = go as t7
| elem a "dqpfvmxzh" = go as t8
| otherwise = go as t9
go _ _ = False
The full module can be found at this adress : WordTree3g.hs
The lists of words to be filtered and used as filters are read from a text file with readFile
then split into words, converted to lower case with map (map toLower) . words
and returned as a doublet.
This function is executed outside of the test to avoid taking into account the construction time of the of the lists in the test.
readWordsIO :: FilePath -> FilePath -> IO ([String], [String])
readWordsIO pathfil pathtxt = do
wrdsfil <- map (map toLower) . words <$> readFile pathfil
wrdstxt <- map (map toLower) . words <$> readFile pathtxt
return (wrdsfil, wrdstxt)
The filtering of the lists will be done with identical functions using the member
function function of each module.
Only the list fitting function is a bit different, using notElem
.
filterByList (wrdsfilt, wrdstxt) = filter (`notElem` wrdsfilt) wrdstxt
filterTextSet (wrdsfilt, wrdstxt) = filter (\w -> not (Set.member w tree)) wrdstxt
where tree = Set.fromList wrdsfilt
filterText1 (wrdsfilt, wrdstxt) = filter (\w -> not (WT1.member w tree)) wrdstxt
where tree = WT1.fromList wrdsfilt
The construction of the trees will be evaluated simply with the fromList
function of each module.
constSet (wrdsfilt, wrdstxt) = Set.fromList wrdsfilt
constTree1 (wrdsfilt, wrdstxt) = WT1.fromList wrdsfilt
The values given by the benchmark are the following:
Lists |
Set |
WordTree 1 |
WordTree 2 |
WordTree 2g |
WordTree 3 |
WordTree 3g |
|
---|---|---|---|---|---|---|---|
sec |
96,44437 |
8,73092 |
7,63710 |
6,94024 |
10,84359 |
6,83521 |
15,85130 |
xfaster/List |
/ |
11,0463 |
12,6284 |
13,8964 |
8,8941 |
14,1099 |
6,0843 |
%lower/List |
/ |
-90,95 % |
-92,08 % |
-92,80 % |
-88,76 % |
-92,91 % |
-83,56% |
xfaster/Set |
/ |
/ |
1,14 |
1,26 |
0,81 |
1,28 |
0,55 |
%lower/Set |
/ |
/ |
-12,53 % |
-20,51 % |
24,20 % |
-21,71 % |
81,55 % |
We already notice the difference between filtering by list and filtering with trees. The gain is enormous! The processing is 10 times faster than with the list.
We also notice that if the filtering with the Set
is very efficient, a filtering with a tree dedicated to each letter is a bit faster.
We can already see that the idea of grouping letters in some branches with the elem
function is a bad idea. These functions, although faster than sorting by list, are slower than sorting by Set
and especially slower than their non-grouped version.
We can also see that the WordTree2 tree is a bit faster than the WordTree1 tree and that the WordTree3 tree is a bit faster than the WordTree2.
It is therefore interesting to have a tree whose sorting is optimized for the word list used as a filter.
As you can see on this summary table, the construction time of a set is much more important than the WordTree1 tree (about 5 times slower). This can be explained very well because Sets
are trees that automatically balance themselves during their construction. This requires tests and modifications of the tree which leads to a slower construction.
We also notice that the more we use an adapted (optimized) tree for the words used as a filter, the lower its construction time.
Set |
WordTree 1 |
WordTree 2 |
WordTree 2g |
WordTree 3 |
WordTree 3g |
|
---|---|---|---|---|---|---|
sec |
0,04696 |
0,00831 |
0,00781 |
0,01192 |
0,00784 |
0,01606 |
xfaster/Set |
1,00 |
5,65 |
6,02 |
3,94 |
5,99 |
2,92 |
%lower/Set |
/ |
-82,30 % |
-83,38 % |
-74,62 % |
-83,31 % |
-65,80 % |
Using a tree filtering is extremely interesting and allows to speed up considerably compared to a filtering by list. The gain is of the order of 90%.
A tree adapted to the words used as a filter also allows to accelerate the processing. The gain being up to up to 20%/25% compared to a Set
.
Nevertheless, statistics of the letters used and to implement a tree optimized for this series takes time. series takes time.
For an occasional and standard use, the use of a Set
remains the simplest and fastest solution to implement.
For intensive use on a massive amount of data, using a dedicated tree can significantly speed up processing time.
Notes
1.↑ |
The processing time will increase exponentially. |
2.↑ |
Most used letters in the French language According to this page, the most common letters used in the French language are:
|
3.↑ |
A separate analysis showed that the most common letters in the search words are: \ [('e',133),('s',103),('u',102),('t',90),('i',85),('a',77),('o',61),('n',59),('r',49),('l',46),('c',34),('d',27),('q',21),('p',21),('v',21),('f',19),('m',18),('x',9),('g',9),('z',8),('j',5),('y',4),('h',4),('b',4)] \ [('a',77),('b',4),('c',34),('d',27),('e',133),('f',19),('g',9),('h',4),('i',85),('j',5),('l',46),('m',18),('n',59),('o',61),('p',21),('q',21),('r',49),('s',103),('t',90),('u',102),('v',21),('x',9),('y',4),('z',8)]
|